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Abstract 

Metric learning has been shown to be highly effective to improve the performance of nearest neighbor classi¬ 
fication. In this paper, we address the problem of metric learning for symmetric positive definite (SPD) matrices 
such as covariance matrices, which arise in many real-world applications. Naively using standard Mahalanobis 
metric learning methods under the Euclidean geometry for SPD matrices is not appropriate, because the difference 
of SPD matrices can be a non-SPD matrix and thus the obtained solution can be uninterpretable. To cope with this 
problem, we propose to use a properly parameterized LogEuclidean distance and optimize the metric with respect 
to kernel-target alignment, which is a supervised criterion for kernel learning. Then the resulting non-trivial op¬ 
timization problem is solved by utilizing the Riemannian geometry. Finally, we experimentally demonstrate the 
usefulness of our LogEuclidean metric learning algorithm on real-world classification tasks for EEG signals and 
texture patches. 


1 Introduction 


Defining a distance to compare data points is an important problem occurring in many machine learning tasks. 
For instance, in classification, the nearest neighbor method Cover & Hart ( |1967[ ) is equipped with a distance to 
identify the nearest neighbors. The performance of such a classifier highly depends on the quality of the equipped 
distance, and hence automatically finding a relevant distance from data has been the aim of many metric learning 
algorithms. So far, various methods have been developed for learning Mahalanobis distances in the Euclidean 
setting, and such metric learning methods have been successfully applied to diverse real-world problems including 
music recommendation Lim et al. 2013jl, face recognition Lu et al. (|2012J, image classification Mensink et al. 
(2012l, link prediction in networks Shaw et al. (20111 and bioinformatics Wang et al. (201 2]>. For the survey of 


recent advances, issues and perspectives in metric learning, see Bellet et al. 2013[ >. 

Classical metric learning approaches focused on an Euclidean setting which do not apply to complex or struc¬ 
tured objects. Recently, metric learning for complex objects such as histograms Cuturi & Avis ( 2014) 1; [Kedem 
et al.| (2012), binary codes Norouzi et al. (201 2) and strings Bellet et al. (|201 1|) have been actively explored, to 


which the Euclidean distance is not relevant. However, to the best of our knowledge, metric learning for symmetric 
positive definite (SPD) matrices has not been investigated thoroughly. Learning from SPD matrices, particularly 
from covariance matrices, arises in a number of important classification tasks such as brain imaging [Arsigny et al.| 
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Euclidean and Riemannian geometry of 2x2 covariance matrices 



Figure 1: Comparison between Euclidean (straight dashed lines) and Riemannian (curved solid lines) distances 
measured between points of the space 


(20061; Dryden et al. ( 2009) , brain-computer interfaces Barachant et al. ( |2010 2013[ >, pedestrian detection Tuzel 
et al. ( 2008| > and texture classification Tuzel et al. (2006 1 ; Tou et al. (2009!. In these applications, covariance ma¬ 
trices are either extracted from a physical model of the studied phenomenon (for diffusion tensor imaging) or as an 
empirical estimator from observations (for signal processing and computer-vision tasks). The purpose of this paper 
is to investigate metric learning for SPD matrices to boost the classification performance. 

space. For example, 2x2 SPD matrix A can be written as A = ° ^ 


SPD matrices belong to an Euclidear 1 


b c 

and the constraints 


with ac — b 2 > 0, a > 0 and c > 0. Then symmetric matrices can be represented as points in 
can be plotted as a cone, inside which SPD matrices lie strictly (see Fig. |T]». A straightforward approach for learning 
a metric in this space would be to simply use the Euclidean distance S e : 


$e(A, B) = \\A — B\\jr, 


( 1 ) 


where ||-||jr denotes the Frobenius norm. The Euclidean geometry of symmetric matrices implies that distances are 
computed along straight lines according to 5 e (see Fig. [T| again). 

However, the Euclidean geometry for averaging SPD matrices can result in a swelling effect Arsigny et al. ( 2007j>, 
i.e., the determinant of the average can be bigger than the determinant of each matrix. As also remarked in F etcher 


& Joshi ( 2004] ) and illustrated in Fig. [T] this geometry creates a non-complete space, meaning that interpolation 
of SPD matrices is possible, but extrapolation can produce indefinite matrices. Hence, the swelling effect and the 
non-completeness of the space can result in uninterpretable solutions. 


'Note that, in order to be an Euclidean space, the space should be equipped with the Frobenius inner product (A, B)jr = Tr(/t B) and the 
derived norm ||A||jr = ( A, A)jr. 
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To avoid this problem, we use a more natural metric to compare SPD matrices, namely, the LogEuclidean 
distance Sg. 


Si(A, B) = 11log (A) - log (B )||j 


( 2 ) 


where log(-) stands for the matrix logarithm. The LogEuclidean metric (and the derived distance and kernel) has 
been used in the literature Arsigny et al. ( 2006] ); Barachant et al. ( 2013| >, but its parameterization remained underes¬ 
timated until recently Yger ( 2013[ >. In~this paper, we propose a supervised approach to learning the LogEuclidean 
metric. More specifically, we formulate our LogEuclidean metric learning problem as the kernel-target alignment 
problem|Cristianini et al. ( |2001 [ );[Cortes et al.] (2012| , and solve the non-trivial optimization problem using the Rie- 
mannian geometry for SPD matrices. Through experiments on signal processing and computer vision applications, 
we demonstrate the efficacy of our proposed metric learning method. 


2 Proposed approach 

In this section, we describe our proposed metric learning method. 


2.1 Formulation of metric learning under LogEuclidean distance 

Let Sd be the set of real symmetric matrices of size d x d. A matrix A is said to be positive-definite (A >- 0) if 
x 1 Ax > 0 for all non-zero x £ K , and the set of d x d SPD matrices is denoted by Vd- 

To learn a metric, we need to parameterize a distance. In this paper, following Bhatia ( 2009j l, we focus on the 
congruent transform, i.e., for any A, G £ Vd , 


I V d ^ V d 
1 T i : A-> G-^AG-i. 

K G~ 2 

Thus, the LogEuclidean distance |2]i is parameterized using G £ Vd as 

5f(A,B)=6 l (r G _i (A),r G _i (B)) 

= ||log (G-^AG-i) — log (gH-BG-sW. 
Our goal is to learn the parameter matrix G from a set of n training samples. 


(3) 


{{X i ,y i )\X i £V d ,y i £{+l,~l}}^ 1 , 


so that the performance of the nearest neighbor classifier on Vd is maximally enhanced. 

G was implicitly chosen as the identity matrix in |Arsigny et al.| ( |2006| ), while G was heuristically chosen as the 
Riemannian mean of training samples in Barachant et al. ( 2013] ) and Yger ( j2013[ i. Although these choices sound 
reasonable, we argue that they are sub-optimal when Sf is used for nearest neighbor classification. Our aim is to 
find the optimal G in terms of classification performance. 
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2.2 Metric learning with kernel-target alignment 


For metric learning, we employ the (centered) kernel target alignment (KTA) criterion Cristianini et al. (2001!; 
Cortes et al.|( |2012| ): 


A(K, K*) 


(UKU, UK*U)jr 

WUKuywuK+uWr 1 


where K is a matrix to be aligned, K* = yy is a target matrix, y = (3/1 ,..., y n ) € R n , U = Id — is 

the centering matrix, l,i is the identity matrix of size d x d, and 1,/ is the d-dimensional vector with all ones. For 
simplicity, we suppose that y is centered, i.e., Uyy T U —> yy T below. 

Let k G (A, X') be the LogEuclidean metric , from which the LogEuclidean distance ([3]) is derived: 

k G (X, X') = Tr (log (g" 3 XG~ * ) log (g~ * X'G~ *)) , 


where X, X ', G £ Vd ■ Let h(G) be the gram matrix for k G : 

hij{G ) = k G {X u Xj). 

Then our metric learning problem is formulated as 


max / (G), 
Gav d y ' 


where 


/ (G) = A(h(G),yy T ) 


(Uh(G)U, yy T )r 

\\Uh(G)U\\jr 


(4) 


(5) 


A naive approach to solving the above optimization problem would be the projected gradient method, i.e., 
iteratively performing gradient ascent over /(G) and projection of the updated solution back onto Vd- However, 
since Vd is an open set (the strict interior of the cone is a set of SPD matrices, see Fig. [T| again), projection does 
not always exist. To cope with this problem, we introduce a more sophisticated approach based on the Riemannian 
geometry below. 


2.3 Riemanian Gradient Optimization 

Considering Vd as a Riemannian manifold provides us useful tools for solving our optimization problem. A basic 
optimization algorithm on the Riemannian manifold is the geodesic gradient descend which optimizes the objective 
function along the geodesic computed from the Riemannian gradient grad & ,/(G). For optimization problem ([4ji, the 
geodesic from G t to G t+ \ is given by 

G t +i =G? exp (? 7 G t _3 grad & , t /(G t )G/ 5 ^ Gf, (6) 


where 77 > 0 is the step size and exp(.) denotes the matrix exponential. Below, we explain how the above formula 
was derived. 


■ For more sophisticated optimization methods, see 


Absil et al. 

2009|and 

Boumal et al. 

2014 
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Figure 2: Mappings between the manifold of positive definite matrices and the tangent plane at the point G. As 
stated in Th. U. Sf, the Euclidean distance in TqV d and the Riemannian distance S r are related but not equal in 
general. 


As stated in Bhatia (2009), Vd is an open subset of the ambient Euclidean space of d x d symmetric matrices 
Sd■ As such, it is an instance of an embedded sub-manifold of S,i and so it is a differentiable manifold of dimension 
d(d+ 1) /2. From this structure of differential manifolds, we can derive the notion of tangent space TcVd at each 
point G £ Vd- In general, tangent spaces are identified with subspaces of the ambient space. In the current setup, as 
Vd is an open sub-manifold, we identify the tangent space at G as T G Vd = Sd, the space of symmetric matrices of 
size d x d. 

Then, in order to turn this differential manifold into a Riemannian manifold, we need to equip its tangent spaces 
with a metric. One choice of metric (leading to a complete Riemannian manifold) is the affine-invariant metric, 
which, for Sa , Sn £ T G Vd, is defined as: 


(Sa,Sb) g = Tt(G- 1 S a G- 1 S b ) ■ 


(7) 


Note that a Riemannian gradient (that is used by first-order Riemannian optimization methods) is defined with 
respect to a given tangent space. Then, in order to be coherent with the metric of the tangent space (defined in 
Eq. 0), grad G /(G), the Riemannian gradient at G can be obtained from the Euclidean gradient V/ as 

grad G /(G) = Gsym(V/(G))G. 

As detailed in Appendix |A.1| the Euclidean gradient V/ (G) can be obtained as 

V/(G) = ^Z y (G)VMG), 

ij 
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where 


Z(G) = U 


yy 


/(G) Uh(G)U 
\Uh(G)U\\jr \\Uh(G)U\\% 


U, 


(G) = X, 3 ( DQ t (G ) [sym (Ay)]) 


QiiG) 


+ Xj 3 {DQj{G) [sym (A^)]) X j 
= log (XT'GXT 


Aij = A7 X. 


i Qj(G)xfx: 

A + A T 


sym (A) = 


where Df (G) G is the directional derivative of / (G) in the direction G. To our knowledge, there is no closed- 
form formula for computing the directional derivatives DQi{G) [sym (Ay)] and DQj{G) [sym (Ay,;)]. However, 
we can numerically evaluate them, as shown in Boumal & Absil| < f2Q 111) and |Al-Mohy & Higham[ ( |2009| . 

Then any displacement in a tangent space TqPs, can be mapped back on the manifold (and vice versa) using the 
reciprocal logarithmic and exponential mappings (see Fig. [2|. Any symmetric matrix Sa belonging to TcPd ,, the 
tangent space at G, can be mapped on Vd (with the reciprocal operation) as 


A =exp G (S A ) = G5 exp (G-*S A G~ty G=, 
S A =log G (A) = G3 log (g-5AG"5) G*. 


( 8 ) 

(9) 


These formula give the basic ingredients for implementing a geodesic gradient ascent method (as illustrated in 
Fig . [ 3 ]) for our objective functiorj^J Indeed, from Eq. ([Sj, the geodesic is given by 


Gt+i = exp Gt (r/grad Gt /(G t )) , 


which leads to Eq. ©■ 


3 Discussions 

In this section, we motivate the use of the LogEuclidean distance and compare it to other distances in Vd- 

3.1 Geometries on Vd 

Depending on the type of geometry chosen, there exist several choices of distances and divergences for comparing 
SPD matrices. As already discussed in literature Arsigny et al. (2006); Pennec et al. (2006); Dryden et al. (2009); 
Fletcher & Joshi ( |2004| l; Cherian et al. ( 201 1[ ), different tools come along with various implicit invariance properties 
and computational complexities. 


3 More details about optimization on Riemannian matrix manifolds and the geometry of Vd ran be found in Absil et al. (2009 1 and in Bhatia 

(2009) . 
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Figure 3: Graphical summary of a geodesic gradient descent method for a function / on V,i- 


As explained in Bhatia ( 2009] l, when using the geometry implied by the metric 
two SPD matrices A and B is computed along a curve (called the geodesic) and 
Riemannian metric (AIRM) distance is defined as 


in Eq. [7] the distance between 
the associated affine-invariant 


6 r (A,B) = \\\og(A- 1 sBA-i)\\r 
= log 2 A l {A,B)\ 


( 10 ) 


where A,; (A, B) are the eigenvalues of the pencil (A, B). 

As illustrated in Fig. |T| when using the Riemannian geometry, the space V,i becomes 
already stated in |Arsigny et al.| ( |2007[ ), this distance is immune to the swelling effect, 
candidate for distance metric learning for covariance matrices. 

However, the AIRM distance comes along with an invariance to a class of congruent 
and it leads to the following isometry for any M £ Vd- 


a complete manifold. As 
Thus, it could be a good 


transforms |Bhatia ( 2009) 


5 r (T M (A),T M (B)) = 6 r (A,B). 


( 11 ) 


Hence, unlike the LogEuclidean distance, the AIRM distance is immune to the transform !’(■) and it can not be 
learned using our approach. 

Apart from their difference in terms of invariance properties, it should be highlighted that the AIRM distance is 
not negative-definite and then, contrary to the LogEuclidean distance, cannot be used for defining positive-definite 
kernels on SPD matrices |Haasdonk & Bahlmann| ( [2004) ; |Sra| ( |201 l) ; |Barachant et al. ( 2013 1. 

Using information geometry and extending divergences to the matrix case Cherian et al. ( 201 1) ; Sra (2011 
2012), a symmetrized LogDeterminant divergence (also called the symmetrized Stein loss) can also be used. This 


divergence can be seen as an approximation of the AIRM distance and as such, there exist some bounds between 
the AIRM distance and this symmetrized divergence |Sra| ( [2011) . Moreover, this divergence is invariant to the same 
transformation as the AIRM distance, but can lead under some conditions to a definite-positive kernel. 
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Figure 4: Illustration of the deformations implied by two mappings, respectively centered at the identity or at a point 
G 


In this work, we look for a distance on SPD matrices tailored for classification. Due to the non-completeness 
of the space and the swelling effect, the Euclidean distance is not a valid candidate. Hence, the distances derived 
from Riemannian geometries (and their approximations) are more promising candidates, but in some cases, they 
are difficult to parametrize. Using the congruent transform L g _ 1 (.), it is possible to parametrize the LogEuclidean 
distance but not the AIRM distance nor the symmetrized Stein loss due to their invariance properties. 


3.2 Interpretations of LogEuclidean metric learning 

Compared to the Euclidean and AIRM distances, the LogEuclidean distance has several advantages: it is a Rie¬ 
mannian distance (immune to the swelling effect) and is easy to compute. In fact, using the fact that the mapping 
in Eq. 0 transforms matrices from the curved space Vd to the flat space Sd, the LogEuclidean distance can be 
interpreted as the Euclidean distance between the matrices mapped into Sd- 

Interestingly, this LogEuclidean metric can be interpreted as the scalar product, 

{Si, Sj)c = Tr (G-lSiG-'SjG-i) , 


in the tangent plane of V,i at the point G, with .S', and Sj the mapping around G of X, and X ? through the logarithmic 
map. Hence, according to the taxonomy developed in Bellet et al.|< |2013) >, since we optimize G through a logarithmic 
mapping on a Riemannian manifold, our approach is an instance of non-linear similarity learning. 

In this view, the parameter G can be seen as the center of the tangent plane, and it has been empirically ob¬ 
served Yger ( 2013| l that the choice of G has a strong impact on the metric behaviour. So far, G has been heuristi- 
cally tuned using either the identity matrix or the Riemannian mean Moakher| (2005 1 ; Jeuris et al. (2012 1 of data. 
However, as highlighted later, mapping a manifold onto a tangent space implies a deformation of the data. This 
deformation can either be harmful or beneficial to classification. By optimizing the choice of G with respect to a 
discrimination criterion, we try to force the deformation to improve classification performance. As shown in Eq. 0. 
every tangent space is equipped with a different metric^] Hence, choosing a new reference for a tangent space leads 
to choosing a new scalar product. In that sense, learning the reference point of a tangent space is equivalent to 
learning a metric. 

The LogEuclidean metric is simply a scalar product (i.e., a linear kernel) applied to data mapped through a 


matrix logarithm. For this reason, it has sometimes been referred to as the LogEuclidean kernel Barachant et al. 


4 The scalar product defined at every tangent space TcVd is close to a Mahalanobis scalar product parametrized by G 


8 
























(20131; Yger ( 20l3) ; Jayasumana et al. ( |2013| . In this regard, the optimization of the LogEuclidean metric can also 
be interpreted as a kernel learning approach. 

The use of the LogEuclidean distance can be interpreted as flattening the Riemannian manifold. As discussed 
below, such a mapping implies some deformations of the space of SPD matrices (see Fig. |4]». Our approach can then 
be interpreted as finding the mapping such that the implied deformations benefit classification performance. 


3.3 Geometrical motivation 

The deformation implied by mapping points in a tangent space has been first stated as the exponential metric increas¬ 


ing (EMI) property in Bhatia (2009). It gives inequalities between the Riemannian distance S r and the Euclidean 
distance S e in the tangent space at the identity Id- 


Theorem 1. Bhatia ( 2009 ) For each pair of points A, B in V,i, we have 


6 r (A, B ) > SfiA, B ) = || log (A) - log(B)||^. 

The equality occurs when A, B and belong to the same geodesic. This property, inherited from the non¬ 
positive curvature of Vd , implies a deformation of the shapes when using the logarithmic map. 

Using the properties of the bijection T i (.) and with the notations in Eq. (|7]» and in Eq. (|9j, a corollary has 
been proposed in Yger (2013jl which extencfs Theorem[T]to any tangent space: 


Theorem 2. ( Yger 2013 For any A, B and G in Vd, with || • || G , the norm associated to the natural scalar product 
in T(jVd satisfies 

6 r (A, B) > 5? (A, B) = II log G (A) - log G (B)|| G . 

Similarly to Theorem[7] the equality occurs when A, B and G belong to the same geodesic. 

From these two theorems, as illustrated in Fig. [4] it follows that changing the reference point for centering the 
tangent space modifies how shapes are deformed once they are mapped in this tangent space. This motivates our 
metric learning approach. 

4 Numerical experiments 

In this section, we report experimental results. 


4.1 Setup 


Concerning the implementation of our approach, we employed a Riemannian trust-regioi 5 Boumal & Absil (2011); 
Absil et aL]f2009| >. In practice, we use the following regularized optimization problem: 


f max /(G) 

J G&v d 

[s.t. S r (Go,G)<e 


for some Go- As we have not found any guarantee concerning the convexity or the geodesic-convexity Wiesel|p012 1 
of the centered KTA, we decided to use the Riemannian mean |Moakher| ( |2Q05| >; peuris et al. ( |2012[ l as Go- Below, 
we set e = 10 for all datasets. 


5 with the implementation provided in the Manopt toolbox provided by Boumal et al. (2014 . 
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Table 1: Comparison of the mean accuracy of the LogEuclidean distance (parameterized by our method Gkta) to 
other distances on a toy dataset with growing sizes of covariance matrices. 


size 

LogEuclidean 

S r 

6 e 

Id 

X 

Gkta 

6x6 

77.50 

77.58 

84.78 

77.66 

82.76 

8x8 

77.46 

77.44 

86.78 

77.42 

83.32 

16x16 

75.58 

75.62 

89.56 

75.48 

85.78 

20x20 

77.06 

77.10 

90.96 

77.16 

87.16 


In our numerical experiments, we report accuracies on balanced datasets using a 1-nearest neighbor (1-NN) 
classifier equipped with different distances. Our baseline distances for the 1-NN classifier comprise the Euclidean 
distance S e , the Riemannian distance S r and the LogEuclidean distance (parameterized by the identity matrix or the 
Riemannian mean of the training samples). We compare all these baseline distances to the LogEuclidean distance 
df with parameter G learned byour approach. In this section, when the reported results are averaged over several 
iterations (as in Tab.[l]and Tab. [ 3 }, we show when the improvement is significan {fusing a bold font. 

4.2 Toy dataset 

First, in order to gain some insights on the behavior of our approach, we designed a simple experiment in which, the 
covariance matrices are generated as follows: 

A. — Qdiag(Ai,..., A r , /ri,..., /r r )Q 
+ Vdiag(|ei|,...,|e 2r -|)V T 

with Q and V, two random orthonormal square matrices, and A^ ~ 7V(5, 0.2) or Aj ~ A/"(4, 0.1) respectively for the 
positive and negative classes and /j, ~ W([l, 6]) and ~ Af(0, 1) independent of the class. The dataset is composed 
of 50 samples in the training set and 500 samples in the test set. It is re-sampled in order to repeat the experiment 
10 times and the results are reported in Tab. [I] 

In this simple experiment, as it was foreseeable, the Euclidean distance achieves good performances. Indeed, 
omitting the additive noise matrix, the data generation protocol only involves a fixed orthonormal matrix Q for 
generating the matrices from both classes. Hence, the separation between the classes just becomes linear in terms of 
the eigenvalues of the covariance matrices. Interestingly, the Riemannian distances (except our approach) performs 
badly for all cases. 

From this experiment, it is interesting to note that the proposed metric learning method helps the LogEuclidean 
distance to overcome its limitation.. Hence, the LogEuclidean distance with parameter learned by our proposed 
method is the best not only among the Riemannian distances but among all candidates. However, this toy experiment 
may be too simple for our applications and we test our methods on real-life datasets below. 


4.3 Brain Computer-Interface 


We tested our approach on Dataset 2a from the BCI Competition IvQ Naeem et al. (2006). The dataset consists 
of EEG signals (recorded from 22 electrodes) on 9 subjects who were asked to perform left hand, right hand, foot 


6 by comparing the two best results with the two-sided Wilcoxon signed rank test at significance level 0.05. 

'http://www.bbci.de/competition/iv/ 
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Figure 5: Illustration of the effect of whitening when the Riemannian mean of the data is non-stationary. 


and tongue motor imagery Pfurtscheller & Neuper (2001) (i.e., EEG signals are recorded while the subject is only 
imagining the given limb movements without actually moving it). 

Following the protocol described in Lotte & Guan ( j2011j >. we only used the EEG signals corresponding to the 
left and right hand. Hence, for every subject, we have 72 trials (i.e. segments of signal) for each class in both 
training and testing. Moreover, we applied the same band-pass filtering ([8 — 30] Hz) in order to pre-process the raw 
signals. 

As shown in literature Barachant et al.| ( |2010| |2013] >; |Yger| ( |2013] >, for motor imagery signals, the spatial co- 
variance matrix of the trials is a very promising feature. Based on this observation, we extracted the covariance 
matrix between sensors (i.e., the spatial covariance matrix) from every filtered time segment (i.e., every example to 
be classified). 

Although we could directly use our approach and try to classify samples, the performance would be poor since 
this dataset is known to show some non-stationarity between sessions. In order to cope with this problem, we em¬ 
ploy an adaptive kernel formulation proposed in Barachant et al!]( j2013| . This method cancels the non-stationarity 
in the data by whitening the covariance matrices in the training session and test session by subtracting the Rieman¬ 
nian mean of the training and test datasets, respectively. In practice, whitening is carried out by first applying the 
congruent transformation E __i (•), where X is the Riemannian mean of the data set. As illustrated in Fig. |5j this 
whitening process has the effect of spreading the covariance matrices from both training and test dataset around the 
identity matrix. For a dataset spread around X, this has the effect of “transporting” the data on the manifold along 
the geodesic. For real-life applications, this whitening process can be criticized since it uses the test data. However, 
this can be carried out in a semi-supervised fashion by estimating X~ 5 during the calibration phase. Therefore, this 
process would still be practical in many applications. 

As reported in Tab. [2j compared to other distances for handling covariance matrices, our proposed approach 
performs the best on 5 subjects and is never the worst. On average over the subjects, we observe a significant gain 
by using an optimized reference point for the LogEuclidean distance. 


4.4 Texture classification 

As another example of real-world problems, we consider a subset of four textures (shown in Fig [6]) of the Brodatz 
dataset Brodatz (1966). Similarly to the works of Mairal et al. (2009) and Yger & Rakotomamonjy ( |2011 1 , we 
extracted 16 x 16 patches from every texture. For every couple of textures, the training set composes patches from 
the left half of each texture and the test set composes patches from the right half. In every set, patches may overlap, 
but the training and test sets do not overlap each other. 
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Table 2: Comparison of the accuracy of the LogEuclidean distance (parameterized by our method Gkta) to other 
distances for the classification of hand movements for 9 subjects of the BCI Competition IV Dataset 2a. 


subject 

LogEuclidean 

S r 

Se 

Id 

Gkta 

1 

80.56 

83.33 

79.17 

77.08 

2 

58.33 

54.17 

59.03 

52.78 

3 

93.06 

95.83 

93.75 

88.19 

4 

54.17 

54.17 

52.78 

56.25 

5 

52.78 

55.56 

52.78 

53.47 

6 

63.89 

61.81 

62.50 

60.42 

7 

52.78 

60.42 

52.78 

61.11 

8 

93.75 

97.92 

92.36 

92.36 

9 

83.33 

84.72 

82.64 

84.72 

mean 

70.29 

71.99 

69.75 

69.60 


Table 3: Comparison of the mean accuracy of the LogEuclidean distance (parameterized by our method Gkta) to 
other distances for classifying patches of texture taken from the Brodatz dataset. 


textures 

LogEuclidean 

<5 r 

Se 

Id 

X 

Gkta 

D2 vs D28 

86.37 

86.58 

88.39 

86.31 

63.53 

D2 vs D29 

75.5 

75.33 

76.69 

75.14 

64.03 

D2 vs D92 

74.88 

74.68 

76.71 

74.33 

68.05 

D28 vs D29 

75.5 0 

75.33 

76.69 

75.14 

64.03 

D28 vs D92 

74.88 

74.68 

76.71 

74.33 

68.05 

D29 vs D92 

83.68 

83.53 

85.83 

83.48 

65.93 

mean 

78.47 

78.36 

80.17 

78.12 

65.60 


Every method was trained on 50 patches selected from the training set and tested on 300 patches selected from 
the test set. Classification rates have been computed as the average of 50 runs after re-sampling of the training and 
test sets. 

As proposed in Tou et al. ( 2009} , we built covariance matrices using a bank of 8 Gabor filters (4 angles and 2 
scales). Hence, each 16 x 16 patch is transformed into a 8 x 8 covariance matrix. 

Tab. [3] summarizes the experiments on texture classification. First, it clearly shows the interest of using the 
Riemannian geometry. Indeed, for this data, the Euclidean distance performs poorly. Moreover, this experiment 
shows again that our supervised metric learning algorithm for choosing the reference of a LogEuclidean distance is 
effective. 


5 Conclusion and perspectives 

In this paper, we have introduced a novel approach for selecting a LogEuclidean metric. This approach is founded 
on theoretical observations about the distortion implied by a LogEuclidean metric. Then, we proposed to cast the 
problem of choosing a relevant metric as a kernel learning algorithm with a kernel target alignment criterion. We 
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Figure 6: Selected textures from the Brodatz dataset 


finally solved this problem using tools from the field of optimization on manifold and applied it to synthetic and 
real-world data. 

Although we restricted ourselves to binary classification problems, the extension of our approach to multi-class 
problems [Ramona et al. ( 201 2\ is straightforward. 

When huge datasets are involved, our optimization algorithm might not scale well. A stochastic setting on 
manifold |BonnabeI ( 2013| could be a promising extention. 

We only considered full-rank covariance matrices, but the extension of our approach to low-rank matrices is a 
challenging and interesting future work, since this corresponds to supervised feature extraction under the LogEu- 
clidean metric. Exploration along this line of research would bridge the gap between our approach and the one 
proposed in jHarandi et al.[ ( j20T4| . 


References 

Absil, P-A, Mahony, Robert, and Sepulchre, Rodolphe. Optimization algorithms on matrix manifolds. Princeton 
University Press, 2009. 

Al-Mohy, Awad H and Higham, Nicholas J. Computing the frechet derivative of the matrix exponential, with an 
application to condition number estimation. SIAM Journal on Matrix Analysis and Applications, 30(4): 1639- 
1657, 2009. 


13 













Arsigny, Vincent, Fillard, Pierre, Pennec, Xavier, and Ayache, Nicholas. Log-euclidean metrics for fast and simple 
calculus on diffusion tensors. Magnetic resonance in medicine , 56(2):411-421, 2006. 

Arsigny, Vincent, Fillard, Pierre, Pennec, Xavier, and Ayache, Nicholas. Geometric means in a novel vector space 
structure on symmetric positive-definite matrices. SIAM journal on matrix analysis and applications, 29(1): 
328-347, 2007. 

Barachant, Alexandre, Bonnet, Stephane, Congedo, Marco, and Jutten, Christian. Riemannian geometry applied to 
bci classification. In LVA/ICA, pp. 629-636, 2010. 

Barachant, Alexandre, Bonnet, Stephane, Congedo, Marco, and Jutten, Christian. Classification of covariance 
matrices using a riemannian-based kernel for bci applications. Neurocomputing, 112:172-178, 2013. 

Bellet, Aurelien, Habrard, Amaury, and Sebban, Marc. Learning good edit similarities with generalization guar¬ 
antees. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in 
Databases (ECML/PKDD), pp. 188-203. Springer, 2011. 

Bellet, Aurelien, Habrard, Amaury, and Sebban, Marc. A survey on metric learning for feature vectors and structured 
data. arXivpreprint arXiv:1306.6709, 2013. 

Bhatia, Rajendra. Positive definite matrices. Princeton University Press, 2009. 

Bonnabel, Silvere. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 
58(9):2217-2229, Sept 2013. 

Boumal, Nicolas and Absil, P-A. Discrete regression methods on the cone of positive-definite matrices. In IEEE 
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4232^-235. IEEE, 2011. 

Boumal, Nicolas, Mishra, Bamdev, Absil, P-A, and Sepulchre, Rodolphe. Manopt, a matlab toolbox for optimization 
on manifolds. The Journal of Machine Learning Research (JMLR), 15(1): 1455-1459, 2014. 

Brodatz, Phil. Textures: a photographic album for artists and designers. Dover Pubns, 1966. 

Cherian, Anoop, Sra, Suvrit, Banerjee, Arindam, and Papanikolopoulos, Nikolaos. Efficient similarity search for 
covariance matrices via the jensen-bregman logdet divergence. In IEEE International Conference on Computer 
Vision (ICCV), pp. 2399-2406. IEEE, 2011. 

Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Algorithms for learning kernels based on centered 
alignment. The Journal of Machine Learning Research (JMLR), 13:795-828, 2012. 

Cover, Thomas and Hart, Peter. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 
13(1):21—27, 1967. 

Cristianini, Nello, Shawe-Taylor, John, Elisseeff, Andre, and Kandola, Jaz. On kernel target alignment. In NIPS, 
volume 2, pp. 4,2001. 

Cuturi, Marco and Avis, David. Ground metric learning. The Journal of Machine Learning Research (JMLR), 15 
(l):533-564, 2014. 

Dryden, Ian L, Koloydenko, Alexey, and Zhou, Diwei. Non-euclidean statistics for covariance matrices, with 
applications to diffusion tensor imaging. The Annals of Applied Statistics, pp. 1102-1123, 2009. 


14 



Fletcher, P Thomas and Joshi, Sarang. Principal geodesic analysis on symmetric spaces: Statistics of diffusion 
tensors. In Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, pp. 87-98. 
Springer, 2004. 

Haasdonk, Bernard and Bahlmann, Claus. Learning with distance substitution kernels. Pattern Recognition, pp. 
220-227, 2004. 

Harandi, Mehrtash, Salzmann, Mathieu, and Hartley, Richard. From manifold to manifold: geometry-aware dimen¬ 
sionality reduction for spd matrices. In Proceedings of the European Conference on Computer Vision (ECCV), 
pp. 17-32, 2014. 

Jayasumana, Sadeep, Hartley, Richard, Salzmann, Mathieu, Li, Hongdong, and Harandi, Mehrtash. Kernel methods 
on the riemannian manifold of symmetric positive definite matrices. In IEEE Conference on Computer Vision 
and Pattern Recognition (CVPR), pp. 73-80, 2013. 

Jeuris, Ben, Vandebril, Raf, and Vandereycken, Bart. A survey and comparison of contemporary algorithms for 
computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39:379^-02, 2012. 

Kedem, Dor, Tyree, Stephen, Sha, Fei, Lanckriet, Gert R, and Weinberger, Kilian Q. Non-linear metric learning. In 
NIPS, pp. 2573-2581, 2012. 

Lim, Daryl, Lanckriet, Gert, and McFee, Brian. Robust structural metric learning. In Proceedings of the 30th 
International Conference on Machine Learning (ICML), pp. 615-623, 2013. 

Lotte, Fabien and Guan, Cuntai. Regularizing common spatial patterns to improve bci designs: unified theory and 
new algorithms. IEEE Transactions on Biomedical Engineering, 58(2):355-362, 2011. 

Lu, Jiwen, Hu, Junlin, Zhou, Xiuzhuang, Shang, Yuanyuan, Tan, Yap-Peng, and Wang, Gang. Neighborhood re¬ 
pulsed metric learning for kinship verification. In IEEE Conference on Computer Vision and Pattern Recognition 
(CVPR), pp. 2594-2601, June 2012. 

Mairal, Julien, Ponce, Jean, Sapiro, Guillermo, Zisserman, Andrew, and Bach, Francis. Supervised dictionary 
learning. In NIPS, 2009. 

Mensink, Thomas, Verbeek, Jakob, Perronnin, Florent, and Csurka, Gabriela. Metric learning for large scale im¬ 
age classification: Generalizing to new classes at near-zero cost. In European Conference on Computer Vision 
(ECCV), pp. 488-501. Springer, 2012. 

Moakher, Maher. A differential geometric approach to the geometric mean of symmetric positive-definite matrices. 
SIAM Journal on Matrix Analysis and Applications, 26(3):735-747, 2005. 

Naeem, M, Brunner, C, Leeb, R, Graimann, B, and Pfurtscheller, G. Seperability of four-class motor imagery data 
using independent components analysis. Journal of neural engineering, 3(3):208, 2006. 

Norouzi, Mohammad, Blei, David M, and Salakhutdinov, Ruslan R. Hamming distance metric learning. In NIPS, 
pp. 1061-1069, 2012. 

Pennec, Xavier, Fillard, Pierre, and Ayache, Nicholas. A riemannian framework for tensor computing. International 
Journal of Computer Vision, 66(l):41-66, 2006. 


15 



Pfurtscheller, Gert and Neuper, Christa. Motor imagery and direct brain-computer communication. Proceedings of 
the IEEE, 89(7): 1123-1134, 2001. 

Ramona, Mathieu, Richard, Gael, and David, Bertrand. Multiclass feature selection with kernel gram-matrix-based 
criteria. IEEE Transactions on Neural Networks and Learning Systems, 23(10): 1611-1623, 2012. 

Shaw, Blake, Huang, Bert, and Jebara, Tony. Learning a distance metric from a network. In NIPS, pp. 1899-1907, 

2011 . 

Sra, Suvrit. Positive definite matrices and the s-divergence. arXiv preprint arXiv: 1110.1773, 2011. 

Sra, Suvrit. A new metric on the manifold of kernel matrices with application to matrix geometric means. In NIPS, 
pp. 144-152, 2012. 

Tou, Jing Yi, Tay, Yong Haur, and Lau, Phooi Yee. Gabor filters as feature images for covariance matrix on texture 
classification problem. In Advances in Neuro-Information Processing, pp. 745-751. Springer, 2009. 

Tuzel, Oncel, Porikli, Fatih, and Meer, Peter. Region covariance: A fast descriptor for detection and classification. 
In European Conference on Computer Vision (ECCV), pp. 589-600, 2006. 

Tuzel, Oncel, Porikli, Fatih, and Meer, Peter. Pedestrian detection via classification on riemannian manifolds. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 30(10): 1713—1727, 2008. 

Wang, Jingyan, Gao, Xin, Wang, Quanquan, and Li, Yongping. Prodis-contshc: learning protein dissimilarity 
measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC 
bioinformatics, 13(Suppl 7):S2, 2012. 

Wiesel, Ami. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12): 
6182-6189, 2012. 

Yger, Florian. A review of kernels on covariance matrices for bci applications. In IEEE International Workshop on 
Machine Learning for Signal Processing (MLSP), pp. 1-6. IEEE, 2013. 

Yger, Florian and Rakotomamonjy, Alain. Wavelet kernel learning. Pattern Recognition, 44(10):2614—2629, 2011. 


A Appendix 

In the appendix, we give more details about the derivation of the our cost function / and we give the results of an 
extra numerical experiment on a toy dataset. 


A.l Gradient of f(G) 


Deriving the cost function / using the classical tricks of the trade of matrix derivation is not obvious (if not impos¬ 
sible). Indeed, this function is based on the matrix logarithm whose derivative is not simple to obtain. 

To obtain the Euclidean gradient V/ of the function /, we need Df(X) [H], its directional derivativ^jat a point 
X and in a direction H, which is defined as: 


Df(X)[H\ = lim 


f{X + hH) - f{X) 
h 


8 Note that it is sometimes called Frechet derivative. 
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In order to obtain the gradient of /, we express its directional derivative. As the the gradient and the directional 
derivatives are linked with the following equality 

Df(G) [g! = (v/(G),G 


we need the adjoint of the directional derivative in order to obtain V/ (G). 

Using standard properties of the directional derivatives, we formulate Df (G) 


G 


as : 


Df(G) 


G 


(UDh (G) G U, yy T )^\\Uh (G) uy - (Uh (G) U, yy T M „ UDh ( G ) 6 U > 


T 


\\Uh(G)U\\^ 


= ( (G) 


G 




(||C//i(G)C/|| 


yy 


T 


f (G) J7/1 (G) E7 

ll^(G)c/||2 r 


u 


( 12 ) 

(13) 


Z(G) 


.F 


In Eq. 13| the core quantity to estimate is Dh(G). Hopefully, the function h(G ) has already been studied 


Boumal & Absil 


of 




(2011). Using those results with the same convention (sym (A) = 


A+A 


* [d log (x< h GX l A [sym (x/x^. h log (X ? l GX j A xjx. l 2 


Vh ij (G)=X i 

+ X” * (d log (x7 5 GX~ ®) [sym (x”*X? log (x“ * GX 


2 ), we have the gradient 

_ 1 \ 1 \ _ 1 

(14) 


X, 


X, 2 XJ 


x; 


Note that the gradient of V/ijj depends on the directionnal derivative of the matrix logarithm. To our knowledge, 
there is no closed-form formula of this directionnal derivative but is can be evaluated numerically using the algorithm 
in (Boumal & Absil 201 1[ Al-Mohy & Higham} 2009| i. 

Then, the Euclidean gradient of / at G is expressed with respect to the adjoint of Dh (G): 


V/ (G) = (Dh (G))* [Z(G)\. 

Hence, after some algebra (detailed in the next section), we express the adjoint as : 

{Dh{G)Y[Z{G)} 

' Z,Dh(G) 


G 

L J IT 


— ( ZijXhij (G), G 


T 


To sum up, using Eq[l4]and EqjT3] we have : 

V/ (G) = ^2 ZijXhij (G). 


(15) 


(16) 


(17) 


Note that, before solving our optimization problem on data, the validity of our Euclidean gradient formula has 
been numerically checked using the sanity checking tools provided in the Manopt toolbox by jBoumal et al.|(2014)>. 
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Table 4: Second experiment on a toy dataset with a weaker level of noise information. 


size 

LogEuclidean 

S r 


Id 

X 

Gkta 

6 x6 

83.22 

83.00 

89.10 

83.06 

91.56 

8 x8 

85.10 

85.00 

91.90 

84.98 

94.82 

16x16 

89.38 

89.54 

96.88 

89.46 

98.28 

20 x20 

92.72 

92.74 

98.60 

92.74 

99.42 


A.2 Adjoint of Dh ( G ) 

In order to obtain the adjoint, using the decomposition over the canonical basis G = J2 k l GkjCkf-i T , we write : 


Z, Dh (G) 


G 


T 


G 

L J / IJ 


= E^(^M G ) 

ij 

= E( v MG),g)^ 

ij 

= E E ( v/l « ( G ) - e fc e ' T )^ 


kl ij 


- E E^ v/, '/ (G 'I G « 

kl \ ij 


fcZ 


(Dh(G))*[Z(G)] 


= ( E ZijVhij (G), G 


r 


(18) 

(19) 

( 20 ) 
( 21 ) 

( 22 ) 

(23) 


A.3 Extra numerical experiments 

We also report the numerical experiments of a simpler toy experiment. In this experiment, the data are generated in 
the same way : 


X —Qtlitt§(Ai,..., A r , ..., i*ir)Q 
+Ediag(|ei|,...,|e 2r |)E T 

with Q and V, two random orthonormal square matrices, and A t ~ A/"(5,0.2) or Aj ~ A/"(4, 0.1) respectively for 
the positive and negative classes and /j,; ~ G([l, 3]) and e 7 ; ~ A/"(0,1) independently of the class. 

The difference between those two experiments resides in the range in which the variables /j. ( are sampled. Here, 
the range of the uniform distribution is smaller and this toy dataset becomes easier to classify. 

The datasets were composed of 50 examples in the training set and 500 in test and every experiment was repeated 
10 times. 
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In this simple experiment, as in the other toy experiment, the Euclidean distance achieves the best performances 
for the same reasons. 

Even if the Euclidean distance seems to be the most suited to those data, this toy experiment is interesting. 
Indeed, among the distances respecting the Riemannian geometry of the data (namely the LogEuclidean and Rie- 
mannian distances), the LogEuclidean distance parametrized by our approach performs the best. Optimizing the 
reference point G (with respect to a KTA criterion) seems to diminish the impact of the uninformative variables //. 

Note that in this experiment, the results of all approaches (including the Riemannian approaches) improves as 
the dimensionality grows. Contrary to the results reported in[T] in this experiment, the level of information is higher 
than the level of noise in the data. Hence, the more dimension we add, the easier it becomes to classify. 
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